Food insecurity is one of the most pressing urban problems as urban populations expand everywhere. Predicting which restaurants are going to fail in a food inspection is therefore a useful method to protect people from food contamination and avoid potential health harzards. It might be the case that neighborhoods with certain environmental risk factors lead to failures in food inspection. It may also be that restaurants with a high propensity to fail in a food inspection select into high food inspection failure rate neighborhoods. This idea is called ‘selection bias’, and we will use the idea to create spatial structure features in our predictive modeling.
library(tidyverse)
library(sf)
#install.packages("QuantPsyc")
library(QuantPsyc)
#install.packages("RSocrata")
library(RSocrata)
library(viridis)
library(caret)
#install.packages("spatstat")
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
In the code below, policeDistricts and policeBeats are downloaded and mapped.
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform(crs=102271) %>%
dplyr::select(District = dist_num)
## Reading layer `OGRGeoJSON' from data source `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON' using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform(crs=102271) %>%
dplyr::select(District = beat_num)
## Reading layer `OGRGeoJSON' from data source `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON' using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
ggplot() +
geom_sf(data = bothPoliceUnits) +
facet_wrap(~Legend) +
labs(title = "Police adminstrative areas, Chicago") +
mapTheme()
chicagoBoundary <-
st_read("/Users/zixiliu/Downloads/riskPrediction_data/chicagoBoundary.shp") %>%
st_transform(crs=102271)
## Reading layer `chicagoBoundary' from data source `/Users/zixiliu/Downloads/riskPrediction_data/chicagoBoundary.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 341164.1 ymin: 552875.4 xmax: 367345.4 ymax: 594870
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
fishnet <-
st_make_grid(chicagoBoundary, cellsize = 500) %>%
st_sf()
ggplot() +
geom_sf(data=chicagoBoundary, fill=NA, colour="black") +
geom_sf(data=fishnet, fill=NA, colour="black") +
labs(title = "Chicago and the fishnet") +
mapTheme()
All the extra grid cells outside of Chicago are not needed. Below a spatial subset is used to select only the grid cells that fall into chicagoBoundary. A grid cell specific uniqueID is also mutate which can be used for joining purposes later.
fishnet <-
fishnet[chicagoBoundary,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
ggplot() +
geom_sf(data=fishnet) +
labs(title = "Fishnet in Chicago") +
mapTheme()
foodInspections <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections-7-1-2018-Present/qizy-d2wf") %>%
filter(Results == "Fail") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("Y", "X"), crs = 4326, agr = "constant")%>%
st_transform(102271) %>%
distinct()
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 24 rows [30,
## 233, 237, 504, 582, 583, 585, 1063, 2318, 2406, 2493, 2729, 2784, 2851,
## 3108, 3154, 3251, 3256, 3393, 3401, ...].
foodInspections
## Simple feature collection with 3112 features and 0 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 334701.1 ymin: 553838 xmax: 367143.4 ymax: 594653.9
## epsg (SRID): 2790
## proj4string: +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## First 10 features:
## geometry
## 1 POINT (351667.9 574797.7)
## 2 POINT (358692.9 578868.8)
## 3 POINT (352667.2 572390.2)
## 4 POINT (345417.5 584250.1)
## 5 POINT (355597 579478.3)
## 6 POINT (349846.3 585796.4)
## 7 POINT (357748 574347.8)
## 8 POINT (343633.5 589863)
## 9 POINT (352400.7 584762.8)
## 10 POINT (356306.8 579459.5)
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = foodInspections, colour="red", size=0.08, show.legend = "point") +
labs(title= "FoodInspections, Chicago 2018 - Present") +
mapTheme()
food_net <-
foodInspections %>%
dplyr::select() %>%
mutate(countInspections = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countInspections = ifelse(is.na(countInspections), 0, countInspections),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = food_net, aes(fill = countInspections)) +
scale_fill_viridis() +
labs(title = "Count of Food Inspections for the fishnet") +
mapTheme()
EnvironmentalComplaints <-
read.socrata("https://data.cityofchicago.org/Environment-Sustainable-Development/CDPH-Environmental-Complaints/fypr-ksnz/data") %>%
mutate(year = substr(complaint_date,1,4)) %>%
filter(year >= 2015 ) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "EnvironmentalComplaints")
OrdinanceViolations <-
read.socrata("https://data.cityofchicago.org/Administration-Finance/Ordinance-Violations/6br9-quuz/data") %>%
mutate(year = substr(hearing_date,1,4)) %>%
filter(year >= 2018 ) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "OrdinanceViolations")
Business <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Current-Active/uupf-x98q") %>%
mutate(year = substr(DATE.ISSUED,1,4)) %>%
filter(year == "2019") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Business")
## Warning in posixify(result[[columnName]]): Unable to properly format date
## field; formatted as character string.
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "StreetLightsOut")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses-Cur rent-Liquor-and-Public-Places/nrmj-3kcf") %>%
filter(BUSINESS.ACTIVITY == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "LiquorRetail")
## Warning in posixify(result[[columnName]]): Unable to properly format date
## field; formatted as character string.
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
## Reading layer `chicago' from data source `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
ggplot() +
geom_sf(data=chicagoBoundary) +
geom_sf(data = rbind(EnvironmentalComplaints, OrdinanceViolations, Business, streetLightsOut,
liquorRetail, sanitation),
size = .1) +
facet_wrap(~Legend, ncol = 2) +
labs(title = "Risk Factors") +
mapTheme()
vars_net <-
rbind(EnvironmentalComplaints, OrdinanceViolations, Business, streetLightsOut,
liquorRetail, sanitation) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
vars_net
## Simple feature collection with 2458 features and 7 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 341164.1 ymin: 552875.4 xmax: 367664.1 ymax: 594875.4
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
## # A tibble: 2,458 x 8
## # Groups: uniqueID [2,458]
## uniqueID geometry Business EnvironmentalCo…
## <chr> <POLYGON [m]> <dbl> <dbl>
## 1 1 ((359164.1 552875.4, 359… 0 0
## 2 10 ((363664.1 552875.4, 364… 0 0
## 3 100 ((355664.1 555375.4, 356… 0 0
## 4 1000 ((360164.1 568375.4, 360… 10 0
## 5 1001 ((360664.1 568375.4, 361… 9 0
## 6 1002 ((361164.1 568375.4, 361… 9 0
## 7 1003 ((361664.1 568375.4, 362… 7 1
## 8 1004 ((362164.1 568375.4, 362… 0 0
## 9 1005 ((362664.1 568375.4, 363… 0 0
## 10 1006 ((363164.1 568375.4, 363… 0 0
## # … with 2,448 more rows, and 4 more variables: LiquorRetail <dbl>,
## # OrdinanceViolations <dbl>, Sanitation <dbl>, StreetLightsOut <dbl>
vars_net.long <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =2, top = "Risk Factors by Fishnet"))
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
vars_net$EnvironmentalComplaints.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(EnvironmentalComplaints), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net$OrdinanceViolations.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(OrdinanceViolations), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net$Business.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Business), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net$LiquorRetail.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(liquorRetail), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net$StreetLightsOut.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(streetLightsOut), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net$Sanitation.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(sanitation), 3)
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
vars_net.long.nn <-
vars_net %>%
dplyr::select(ends_with(".nn")) %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =2, top = "Nearest Neighbor risk Factors by Fishnet"))
It also may be reasonable to measure distance to a single point, like the centroid for the Loop neighbor - Chicago’s Central Business District. This is done with st_distance.
loopPoint <-
neighborhoods %>%
filter(name == "Loop") %>%
st_centroid()
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant
## over geometries of x
vars_net$loopDistance =
st_distance(st_centroid(vars_net),loopPoint) %>%
as.numeric()
## Warning in st_centroid.sf(vars_net): st_centroid assumes attributes are
## constant over geometries of x
ggplot() +
geom_sf(data=vars_net, aes(fill=loopDistance)) +
scale_fill_viridis() +
labs(title="Euclidean distance to The Loop") +
mapTheme()
final_net <-
left_join(food_net, st_set_geometry(vars_net, NULL), by="uniqueID")
Finally, neighborhood name and policeDistrict is spatially joined to the final_net using grid cell centroids. Note that some grid cell centroids do not fall into a neighborhood, returning NA, which is then removed with na.omit. Other neighborhoods are so small, they are only represented by one grid cell.
final_net <-
st_centroid(final_net) %>%
st_join(., dplyr::select(neighborhoods, name)) %>%
st_join(., dplyr::select(policeDistricts, District)) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf()
## Warning in st_centroid.sf(final_net): st_centroid assumes attributes are
## constant over geometries of x
final_net1 <- final_net %>% na.omit()
dplyr::select(final_net1, name, District) %>%
gather(Variable, Value, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Value)) +
facet_wrap(~Variable) +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Aggregate Areas") +
mapTheme() + theme(legend.position = "none")
## Warning: attributes are not identical across measure variables;
## they will be dropped
final_net.nb <- poly2nb(final_net, queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
ggplot() +
geom_sf(data = slice(final_net, 1483, 1484,1485, 1456, 1458,1429,1430,1431), fill = "red") +
geom_sf(data = slice(final_net, 1457), fill = "black") +
geom_sf(data = food_net, fill = NA) +
labs(title = "Example of 'Queen' contiguity") +
mapTheme()
final_net.localMorans <-
cbind(
as.data.frame(localmoran(final_net$countInspections, final_net.weights)),
as.data.frame(food_net, NULL)) %>%
st_sf() %>%
dplyr::select(Inspection_Count = countInspections,
Local_Morans_I = Ii,
P_Value = `Pr(z > 0)`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Food Inspection Failures"))
final_net <-
final_net %>%
mutate(Inspection.isSig = ifelse(localmoran(final_net$countInspections,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(Inspection.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, Inspection.isSig == 1))), 1 ))
## Warning in st_centroid.sf(final_net): st_centroid assumes attributes are
## constant over geometries of x
## Warning in st_centroid.sf(filter(final_net, Inspection.isSig == 1)):
## st_centroid assumes attributes are constant over geometries of x
ggplot() +
geom_sf(data = final_net, aes(fill = Inspection.isSig.dist)) +
scale_fill_viridis() +
labs(title = "Distance to highly significant local Food Inspection Failures hotspots") +
mapTheme()
correlation.long <-
st_set_geometry(final_net, NULL) %>%
dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District) %>%
gather(Variable, Value, -countInspections)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countInspections, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countInspections)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Inspection count as a function of risk factors")
ggplot(final_net, aes(countInspections)) +
geom_histogram(binwidth = 1) +
labs(title = "Distribution of Food Inspection Failures by grid cell")
reg.vars <- c("EnvironmentalComplaints.nn", "OrdinanceViolations.nn", "Business.nn",
"LiquorRetail.nn", "StreetLightsOut.nn", "Sanitation.nn", "loopDistance")
reg.ss.vars <- c("EnvironmentalComplaints.nn", "OrdinanceViolations.nn", "Business.nn", "LiquorRetail.nn", "StreetLightsOut.nn", "Sanitation.nn", "loopDistance",
"Inspection.isSig", "Inspection.isSig.dist")
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countInspections ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countInspections",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countInspections, Prediction, geometry)
## This hold out fold is 26
## This hold out fold is 25
## This hold out fold is 91
## This hold out fold is 33
## This hold out fold is 82
## This hold out fold is 40
## This hold out fold is 10
## This hold out fold is 4
## This hold out fold is 6
## This hold out fold is 62
## This hold out fold is 23
## This hold out fold is 78
## This hold out fold is 98
## This hold out fold is 38
## This hold out fold is 70
## This hold out fold is 68
## This hold out fold is 42
## This hold out fold is 77
## This hold out fold is 11
## This hold out fold is 102
## This hold out fold is 45
## This hold out fold is 29
## This hold out fold is 63
## This hold out fold is 53
## This hold out fold is 1
## This hold out fold is 95
## This hold out fold is 24
## This hold out fold is 61
## This hold out fold is 65
## This hold out fold is 5
## This hold out fold is 48
## This hold out fold is 55
## This hold out fold is 100
## This hold out fold is 2
## This hold out fold is 81
## This hold out fold is 87
## This hold out fold is 54
## This hold out fold is 16
## This hold out fold is 59
## This hold out fold is 60
## This hold out fold is 94
## This hold out fold is 79
## This hold out fold is 50
## This hold out fold is 3
## This hold out fold is 17
## This hold out fold is 27
## This hold out fold is 52
## This hold out fold is 96
## This hold out fold is 39
## This hold out fold is 19
## This hold out fold is 49
## This hold out fold is 22
## This hold out fold is 20
## This hold out fold is 101
## This hold out fold is 34
## This hold out fold is 47
## This hold out fold is 9
## This hold out fold is 13
## This hold out fold is 73
## This hold out fold is 21
## This hold out fold is 92
## This hold out fold is 56
## This hold out fold is 74
## This hold out fold is 66
## This hold out fold is 18
## This hold out fold is 72
## This hold out fold is 97
## This hold out fold is 67
## This hold out fold is 28
## This hold out fold is 46
## This hold out fold is 35
## This hold out fold is 7
## This hold out fold is 89
## This hold out fold is 64
## This hold out fold is 14
## This hold out fold is 80
## This hold out fold is 71
## This hold out fold is 43
## This hold out fold is 57
## This hold out fold is 36
## This hold out fold is 12
## This hold out fold is 37
## This hold out fold is 84
## This hold out fold is 41
## This hold out fold is 88
## This hold out fold is 90
## This hold out fold is 76
## This hold out fold is 86
## This hold out fold is 32
## This hold out fold is 30
## This hold out fold is 93
## This hold out fold is 99
## This hold out fold is 15
## This hold out fold is 69
## This hold out fold is 58
## This hold out fold is 83
## This hold out fold is 44
## This hold out fold is 75
## This hold out fold is 85
## This hold out fold is 51
## This hold out fold is 8
## This hold out fold is 31
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countInspections",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countInspections, Prediction, geometry)
## This hold out fold is 26
## This hold out fold is 25
## This hold out fold is 91
## This hold out fold is 33
## This hold out fold is 82
## This hold out fold is 40
## This hold out fold is 10
## This hold out fold is 4
## This hold out fold is 6
## This hold out fold is 62
## This hold out fold is 23
## This hold out fold is 78
## This hold out fold is 98
## This hold out fold is 38
## This hold out fold is 70
## This hold out fold is 68
## This hold out fold is 42
## This hold out fold is 77
## This hold out fold is 11
## This hold out fold is 102
## This hold out fold is 45
## This hold out fold is 29
## This hold out fold is 63
## This hold out fold is 53
## This hold out fold is 1
## This hold out fold is 95
## This hold out fold is 24
## This hold out fold is 61
## This hold out fold is 65
## This hold out fold is 5
## This hold out fold is 48
## This hold out fold is 55
## This hold out fold is 100
## This hold out fold is 2
## This hold out fold is 81
## This hold out fold is 87
## This hold out fold is 54
## This hold out fold is 16
## This hold out fold is 59
## This hold out fold is 60
## This hold out fold is 94
## This hold out fold is 79
## This hold out fold is 50
## This hold out fold is 3
## This hold out fold is 17
## This hold out fold is 27
## This hold out fold is 52
## This hold out fold is 96
## This hold out fold is 39
## This hold out fold is 19
## This hold out fold is 49
## This hold out fold is 22
## This hold out fold is 20
## This hold out fold is 101
## This hold out fold is 34
## This hold out fold is 47
## This hold out fold is 9
## This hold out fold is 13
## This hold out fold is 73
## This hold out fold is 21
## This hold out fold is 92
## This hold out fold is 56
## This hold out fold is 74
## This hold out fold is 66
## This hold out fold is 18
## This hold out fold is 72
## This hold out fold is 97
## This hold out fold is 67
## This hold out fold is 28
## This hold out fold is 46
## This hold out fold is 35
## This hold out fold is 7
## This hold out fold is 89
## This hold out fold is 64
## This hold out fold is 14
## This hold out fold is 80
## This hold out fold is 71
## This hold out fold is 43
## This hold out fold is 57
## This hold out fold is 36
## This hold out fold is 12
## This hold out fold is 37
## This hold out fold is 84
## This hold out fold is 41
## This hold out fold is 88
## This hold out fold is 90
## This hold out fold is 76
## This hold out fold is 86
## This hold out fold is 32
## This hold out fold is 30
## This hold out fold is 93
## This hold out fold is 99
## This hold out fold is 15
## This hold out fold is 69
## This hold out fold is 58
## This hold out fold is 83
## This hold out fold is 44
## This hold out fold is 75
## This hold out fold is 85
## This hold out fold is 51
## This hold out fold is 8
## This hold out fold is 31
reg.spatialCV <- crossValidate(
dataset = final_net%>%na.omit(),
id = "name",
dependentVariable = "countInspections",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countInspections, Prediction, geometry)
## This hold out fold is Riverdale
## This hold out fold is Hegewisch
## This hold out fold is West Pullman
## This hold out fold is South Deering
## This hold out fold is Morgan Park
## This hold out fold is Mount Greenwood
## This hold out fold is Roseland
## This hold out fold is Pullman
## This hold out fold is East Side
## This hold out fold is Beverly
## This hold out fold is Washington Heights
## This hold out fold is Burnside
## This hold out fold is Calumet Heights
## This hold out fold is Chatham
## This hold out fold is South Chicago
## This hold out fold is Auburn Gresham
## This hold out fold is Ashburn
## This hold out fold is Avalon Park
## This hold out fold is West Lawn
## This hold out fold is Grand Crossing
## This hold out fold is South Shore
## This hold out fold is Chicago Lawn
## This hold out fold is Englewood
## This hold out fold is Woodlawn
## This hold out fold is Clearing
## This hold out fold is Jackson Park
## This hold out fold is Washington Park
## This hold out fold is Garfield Ridge
## This hold out fold is West Elsdon
## This hold out fold is Gage Park
## This hold out fold is Hyde Park
## This hold out fold is New City
## This hold out fold is Fuller Park
## This hold out fold is Archer Heights
## This hold out fold is Brighton Park
## This hold out fold is Grand Boulevard
## This hold out fold is Kenwood
## This hold out fold is Oakland
## This hold out fold is Little Village
## This hold out fold is Mckinley Park
## This hold out fold is Bridgeport
## This hold out fold is Armour Square
## This hold out fold is Douglas
## This hold out fold is Lower West Side
## This hold out fold is North Lawndale
## This hold out fold is Chinatown
## This hold out fold is Near South Side
## This hold out fold is Museum Campus
## This hold out fold is Little Italy, UIC
## This hold out fold is West Loop
## This hold out fold is Austin
## This hold out fold is Printers Row
## This hold out fold is Garfield Park
## This hold out fold is Grant Park
## This hold out fold is United Center
## This hold out fold is Greektown
## This hold out fold is Loop
## This hold out fold is Millenium Park
## This hold out fold is Humboldt Park
## This hold out fold is West Town
## This hold out fold is River North
## This hold out fold is Streeterville
## This hold out fold is Ukrainian Village
## This hold out fold is East Village
## This hold out fold is Rush & Division
## This hold out fold is Wicker Park
## This hold out fold is Gold Coast
## This hold out fold is Galewood
## This hold out fold is Old Town
## This hold out fold is Lincoln Park
## This hold out fold is Belmont Cragin
## This hold out fold is Hermosa
## This hold out fold is Logan Square
## This hold out fold is Bucktown
## This hold out fold is Montclare
## This hold out fold is Sheffield & DePaul
## This hold out fold is Dunning
## This hold out fold is Avondale
## This hold out fold is North Center
## This hold out fold is Lake View
## This hold out fold is Portage Park
## This hold out fold is Irving Park
## This hold out fold is Boystown
## This hold out fold is Wrigleyville
## This hold out fold is Uptown
## This hold out fold is Albany Park
## This hold out fold is Lincoln Square
## This hold out fold is Norwood Park
## This hold out fold is Jefferson Park
## This hold out fold is Sauganash,Forest Glen
## This hold out fold is North Park
## This hold out fold is Andersonville
## This hold out fold is Edgewater
## This hold out fold is West Ridge
## This hold out fold is Edison Park
## This hold out fold is Rogers Park
reg.ss.spatialCV <- crossValidate(
dataset = final_net%>%na.omit(),
id = "name",
dependentVariable = "countInspections",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countInspections, Prediction, geometry)
## This hold out fold is Riverdale
## This hold out fold is Hegewisch
## This hold out fold is West Pullman
## This hold out fold is South Deering
## This hold out fold is Morgan Park
## This hold out fold is Mount Greenwood
## This hold out fold is Roseland
## This hold out fold is Pullman
## This hold out fold is East Side
## This hold out fold is Beverly
## This hold out fold is Washington Heights
## This hold out fold is Burnside
## This hold out fold is Calumet Heights
## This hold out fold is Chatham
## This hold out fold is South Chicago
## This hold out fold is Auburn Gresham
## This hold out fold is Ashburn
## This hold out fold is Avalon Park
## This hold out fold is West Lawn
## This hold out fold is Grand Crossing
## This hold out fold is South Shore
## This hold out fold is Chicago Lawn
## This hold out fold is Englewood
## This hold out fold is Woodlawn
## This hold out fold is Clearing
## This hold out fold is Jackson Park
## This hold out fold is Washington Park
## This hold out fold is Garfield Ridge
## This hold out fold is West Elsdon
## This hold out fold is Gage Park
## This hold out fold is Hyde Park
## This hold out fold is New City
## This hold out fold is Fuller Park
## This hold out fold is Archer Heights
## This hold out fold is Brighton Park
## This hold out fold is Grand Boulevard
## This hold out fold is Kenwood
## This hold out fold is Oakland
## This hold out fold is Little Village
## This hold out fold is Mckinley Park
## This hold out fold is Bridgeport
## This hold out fold is Armour Square
## This hold out fold is Douglas
## This hold out fold is Lower West Side
## This hold out fold is North Lawndale
## This hold out fold is Chinatown
## This hold out fold is Near South Side
## This hold out fold is Museum Campus
## This hold out fold is Little Italy, UIC
## This hold out fold is West Loop
## This hold out fold is Austin
## This hold out fold is Printers Row
## This hold out fold is Garfield Park
## This hold out fold is Grant Park
## This hold out fold is United Center
## This hold out fold is Greektown
## This hold out fold is Loop
## This hold out fold is Millenium Park
## This hold out fold is Humboldt Park
## This hold out fold is West Town
## This hold out fold is River North
## This hold out fold is Streeterville
## This hold out fold is Ukrainian Village
## This hold out fold is East Village
## This hold out fold is Rush & Division
## This hold out fold is Wicker Park
## This hold out fold is Gold Coast
## This hold out fold is Galewood
## This hold out fold is Old Town
## This hold out fold is Lincoln Park
## This hold out fold is Belmont Cragin
## This hold out fold is Hermosa
## This hold out fold is Logan Square
## This hold out fold is Bucktown
## This hold out fold is Montclare
## This hold out fold is Sheffield & DePaul
## This hold out fold is Dunning
## This hold out fold is Avondale
## This hold out fold is North Center
## This hold out fold is Lake View
## This hold out fold is Portage Park
## This hold out fold is Irving Park
## This hold out fold is Boystown
## This hold out fold is Wrigleyville
## This hold out fold is Uptown
## This hold out fold is Albany Park
## This hold out fold is Lincoln Square
## This hold out fold is Norwood Park
## This hold out fold is Jefferson Park
## This hold out fold is Sauganash,Forest Glen
## This hold out fold is North Park
## This hold out fold is Andersonville
## This hold out fold is Edgewater
## This hold out fold is West Ridge
## This hold out fold is Edison Park
## This hold out fold is Rogers Park
reg.summary <-
rbind(
mutate(reg.cv, Error = countInspections - Prediction,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = countInspections - Prediction,
Regression = "Random k-fold CV: Spatial Structure"),
mutate(reg.spatialCV, Error = countInspections - Prediction,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = countInspections - Prediction,
Regression = "Spatial LOGO-CV: Spatial Structure")) %>%
st_sf()
Figure below maps predictions for each Regression. Importantly, the model seems to generalize equally well regardless of the cross-validation type. It is clear however, that the models with the spatial structural features do a better job picking up on very localized hotspots. Doing so, was the goal of these features, but it is particualrly notable that in the case of spatial cross-validation, other local hotpots can help predict those in neighborhoods left out.
grid.arrange(
reg.summary %>%
ggplot() +
geom_sf(aes(fill = Prediction)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Predicted Food Inspection Failures by Regression") +
mapTheme() + theme(legend.position="bottom"),
filter(reg.summary, Regression == "Random k-fold CV: Just Risk Factors") %>%
ggplot() +
geom_sf(aes(fill = countInspections)) +
scale_fill_viridis() +
labs(title = "Observed Food Inspection Failures\n") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
Focusing now only on the spatial cross-validation models, for comparison, Figure below maps prediction errors. Without the local spatial structure, many of the hotspots go under-predicted leaving clustered errors at the hotspots. Note that these are not absolute errors, but raw errors.
filter(reg.summary, Regression == "Spatial LOGO-CV: Just Risk Factors" |
Regression == "Spatial LOGO-CV: Spatial Structure") %>%
ggplot() +
geom_sf(aes(fill = Error)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Food Inspection Failure errors by Regression") +
mapTheme()
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
summarize(MAE = round(mean(abs(Prediction - countInspections), na.rm = T),2),
SD_MAE = round(sd(abs(Prediction - countInspections), na.rm = T),2)) %>%
kable(caption = "MAE by regression") %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(4, color = "black", background = "#FDE725FF")
| Regression | MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 1.03 | 1.50 |
| Random k-fold CV: Spatial Structure | 0.91 | 1.22 |
| Spatial LOGO-CV: Just Risk Factors | 1.11 | 1.54 |
| Spatial LOGO-CV: Spatial Structure | 0.97 | 1.24 |
st_set_geometry(reg.summary, NULL) %>%
group_by(Regression) %>%
mutate(Inspection_Decile = ntile(countInspections, 10)) %>%
group_by(Regression, Inspection_Decile) %>%
summarize(meanObserved = mean(countInspections, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -Inspection_Decile) %>%
ggplot(aes(Inspection_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = Inspection_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed food inspection failures by observed food inspection failures decile")
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"),
year = 2017, state=17, county=031, geometry=T) %>%
st_transform(102271) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
.[neighborhoods,]
##
|
| | 0%
|
|= | 1%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 11%
|
|======== | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|=========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 25%
|
|================= | 26%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 31%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 34%
|
|======================= | 35%
|
|======================= | 36%
|
|======================== | 37%
|
|======================== | 38%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 42%
|
|============================ | 43%
|
|============================= | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|=================================== | 55%
|
|==================================== | 56%
|
|===================================== | 56%
|
|===================================== | 58%
|
|====================================== | 59%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|============================================ | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================= | 70%
|
|============================================== | 70%
|
|=============================================== | 72%
|
|=============================================== | 73%
|
|================================================ | 73%
|
|================================================ | 75%
|
|================================================= | 75%
|
|================================================== | 76%
|
|================================================== | 78%
|
|=================================================== | 78%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|===================================================== | 82%
|
|====================================================== | 83%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|=========================================================== | 90%
|
|============================================================ | 92%
|
|============================================================= | 93%
|
|============================================================= | 94%
|
|============================================================== | 95%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 100%
Like Boston, Chicago is a very segregated City, as the map below shows.
ggplot() +
geom_sf(data = tracts17, aes(fill = raceContext)) +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Race Context", name="Race Context") +
mapTheme()
Not surprisingly however, the model with the spatial features not only reports lower errors but a much lower difference in errors across neighborhood context. Next, let’s contrast this finding with this algorithm’s overall usefulness as a police resouce allocation tool.
final_reg <-
filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Structure" |
Regression == "Spatial LOGO-CV: Just Risk Factors") %>%
mutate(uniqueID = rownames(.))
final_reg.tracts <-
st_join(st_centroid(final_reg), tracts17) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(final_reg, uniqueID)) %>%
st_sf() %>%
na.omit()
## Warning in st_centroid.sf(final_reg): st_centroid assumes attributes are
## constant over geometries of x
st_set_geometry(final_reg.tracts, NULL) %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Mean Error by neighborhood racial context") %>%
kable_styling("striped", full_width = F)
| Regression | Majority_Non_White | Majority_White |
|---|---|---|
| Spatial LOGO-CV: Just Risk Factors | -0.1055619 | 0.1182874 |
| Spatial LOGO-CV: Spatial Structure | 0.0035883 | 0.0001193 |
#install.packages("raster")
library(raster)
burg_ppp <- as.ppp(st_coordinates(foodInspections), W = st_bbox(final_net))
## Warning: 8 points were rejected as lying outside the specified window
burg_KD.1000 <- spatstat::density.ppp(burg_ppp, 1000)
burg_KD.1500 <- spatstat::density.ppp(burg_ppp, 1500)
burg_KD.2000 <- spatstat::density.ppp(burg_ppp, 2000)
burg_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(burg_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(burg_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(burg_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
burg_KD.df$Legend <- factor(burg_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
ggplot(data=burg_KD.df, aes(x=x, y=y)) +
geom_raster(aes(fill=layer)) +
facet_wrap(~Legend) +
coord_sf(crs=st_crs(final_net)) +
scale_fill_viridis() +
mapTheme()
The code block below creates a kernel density with a 1000 foot search radius. as.ppp converts the food inspection failures coordinates to a ppp class unique to the spatstat package. The density function creates the kernel density. To map the results, the ppp is converted to a data frame and then an sf layer. Points are spatially joined to the final_net and the mean density is taken. Here density is visualized with a sample_n of points overlayed on top.
burg_ppp <- as.ppp(st_coordinates(foodInspections), W = st_bbox(final_net))
## Warning: 8 points were rejected as lying outside the specified window
burg_KD <- spatstat::density.ppp(burg_ppp, 1000)
as.data.frame(burg_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(foodInspections, 1500), size = .5) +
scale_fill_viridis() +
mapTheme()
Next, a goodness of fit indicator is created to illustrate whether the risk predictions capture more of the observed food inspection failures relative to the kernel density. If the predictions do capture more observed food inspection failures, then the risk predictive model provides more robust targeting tool for allocating police resources. Here are the steps.
# Compute kernel density
Inspec_ppp <- as.ppp(st_coordinates(foodInspections), W = st_bbox(final_net))
## Warning: 8 points were rejected as lying outside the specified window
Inspec_KD <- spatstat::density.ppp(Inspec_ppp, 1000)
# Convert kernel density to grid cells taking the mean
Inspec_KDE_sf <- as.data.frame(Inspec_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
#Mutate the Risk_Category field as defined below.
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
#Bind to a layer where test set crime counts are spatially joined to the fisnnet.
bind_cols(
aggregate(
dplyr::select(foodInspections) %>% mutate(InspecCount = 1), ., length) %>%
mutate(InspecCount = replace_na(InspecCount, 0))) %>%
#Select the fields we need
dplyr::select(label, Risk_Category, InspecCount)
head(Inspec_KDE_sf)
## Simple feature collection with 6 features and 3 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 359164.1 ymin: 552875.4 xmax: 362164.1 ymax: 553375.4
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=36.66666666666666 +lon_0=-88.33333333333333 +k=0.9999749999999999 +x_0=300000 +y_0=0 +ellps=GRS80 +units=m +no_defs
## label Risk_Category InspecCount geometry
## 1 Kernel Density 1% to 29% 0 POLYGON ((359164.1 552875.4...
## 2 Kernel Density 1% to 29% 0 POLYGON ((359664.1 552875.4...
## 3 Kernel Density 1% to 29% 0 POLYGON ((360164.1 552875.4...
## 4 Kernel Density 1% to 29% 0 POLYGON ((360664.1 552875.4...
## 5 Kernel Density 1% to 29% 0 POLYGON ((361164.1 552875.4...
## 6 Kernel Density 1% to 29% 0 POLYGON ((361664.1 552875.4...
Now, to repeat for the risk predictions. Note the prediction from the LOGO-CV with the spatial features is being used here.
Inspec_risk_sf <-
filter(final_reg, Regression == "Spatial LOGO-CV: Spatial Structure") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
bind_cols(
aggregate(
dplyr::select(foodInspections) %>% mutate(InspecCount = 1), ., length) %>%
mutate(InspecCount = replace_na(InspecCount, 0))) %>%
dplyr::select(label,Risk_Category, InspecCount)
For each grid cell and model type (density vs. risk prediction), there is now an associated risk category and observed food inspection failure count. Below a map is generated of the risk categories for both model types with test set points overlayed. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of food inspection failure points.
rbind(Inspec_KDE_sf, Inspec_risk_sf) %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(foodInspections, 1500), size = .1, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE) +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Relative to test set points (in black)") +
mapTheme()
Finally, the rate of food inspection failure points by risk category and model type is calculated and plotted. What we would hope to see is that the risk predictions capture a greater share of observed food inspection failure events in the highest risk category relative to the kernel density. The very simple model created here narrowly edges out the kernel density in the highest risk category.
rbind(Inspec_KDE_sf, Inspec_risk_sf) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countInspections = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_Failures = countInspections / sum(countInspections)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_Failures)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(discrete = TRUE)
Food Inspection Failure was chosen as the outcome of interest because we assume that it may suffer from selection bias. Ideally, inspectors would frequently visit every food facility, every day to ensure it meets safety standards. However, in reality, we don’t have that much inspectors and we need to decide how to allocate the limited inspection workforce to find and remediate as many establishments with food hazards as possible. Imagine the scenario in which the inspections problem is crime prediction and we need to send inspectors to “risky” area where restaurants have a tendency to violate the food safety criteria. It may also be the case that inspectors have some unobserved selction criteria on which restaurants to inspect.
I would currently not recommend this algorithm for production, because there are still spatial patterns in the risk prediction and I need to create more features to explain the spatial structure. I will look into other spatial factors such as the location of shopping malls and demographics of a neighborhood to improve my model in future. If these neighborhood characteristics significantly improve my prediction accuracy, then I would like to recommend this algorithm to be put into production.